Reinforcement Learning with Replacing Eligibility
نویسندگان
چکیده
The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the ooine TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative eeciency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results connrm these analyses and show that they are applicable more generally. In particular, we show that replacing traces sig-niicantly improve performance and reduce parameter sensitivity on the \Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator. Two fundamental mechanisms have been used in reinforcement learning to handle delayed reward. One is temporal-diierence (TD) learning, as in the TD() algorithm (Sutton, 1988) and in Q-learning (Watkins, 1989). TD learning in eeect constructs an internal reward signal that is less delayed than the original, external one. However, TD methods can eliminate the delay completely only on fully Markov problems, which are rare in practice. In most problems some delay always remains between an action and its eeective reward, and on all problems some delay is always present during the time before TD learning is complete. Thus, there is 2 a general need for a second mechanism to handle whatever delay is not eliminated by TD learning. The second mechanism that has been widely used for handling delay is the eligibility trace. 1 Introduced by Klopf (1972), eligibility traces have been used in a variety of reinforcement learning systems (e. Systematic empirical studies of eligibility traces in conjunction with TD methods were made by Sutton (1984), and theoretical …
منابع مشابه
Replacing eligibility trace for action-value learning with function approximation
The eligibility trace is one of the most used mechanisms to speed up reinforcement learning. Earlier reported experiments seem to indicate that replacing eligibility traces would perform better than accumulating eligibility traces. However, replacing traces are currently not applicable when using function approximation methods where states are not represented uniquely by binary values. This pap...
متن کاملBidding Strategy on Demand Side Using Eligibility Traces Algorithm
Restructuring in the power industry is followed by splitting different parts and creating a competition between purchasing and selling sections. As a consequence, through an active participation in the energy market, the service provider companies and large consumers create a context for overcoming the problems resulted from lack of demand side participation in the market. The most prominent ch...
متن کاملUsing Sliding Mode Controller and Eligibility Traces for Controlling the Blood Glucose in Diabetic Patients at the Presence of Fault
Some people suffering from diabetes use insulin injection pumps to control the blood glucose level. Sometimes, the fault may occur in the sensor or actuator of these pumps. The main objective of this paper is controlling the blood glucose level at the desired level and fault-tolerant control of these injection pumps. To this end, the eligibility traces algorithm is combined with the sliding mod...
متن کاملA Logarithmic - time Updating Algorithm for TD ( ) Learning
Temporal-di erence (TD) method is an incremental learning method for long term prediction problem. Most reinforcement learning methods are based on it. So as to cope with partial observability, we have to combine it with the idea of eligibility traces, which causes the matter of time complexity. There are some conventional ways to reduce it, which are unavailable in environments where there may...
متن کاملThe Analysis of Experimental Results of Machine Learning Approach
In this article is analyzed a reinforcement learning method, in which is defined a subject of learning. The essence of this method is the selection of activities by a try and fail process and awarding deferred rewards. If an environment is characterized by the Markov property, then step-by-step dynamics will enable forecasting of subsequent conditions and awarding subsequent rewards on the basi...
متن کامل